Aggregate Memory as an Intermediate Checkpoint Storage Device
نویسندگان
چکیده
Applications that generate bursty I/O load, like checkpointing, require additional support to perform efficiently on next generation petascale supercomputers. Tens of thousands of processors, generating terabytes of snapshot data at once at each timestep, can easily overwhelm a storage system. Further, even at the current peak I/O bandwidth rates, offered by parallel file system deployments at leadership class facilities, an application is likely to spend a significant portion of its runtime checkpointing. To address these issues, we propose a checkpoint storage device, built from memory resources, that acts as an intermediary to the central parallel file system. Our system comprises of a dedicated manager that aggregates memory resources from processors (benefactors) and makes it available as a collective space for checkpointing clients, using a standard POSIX file system interface. We argue that such a system has the potential to alleviate the I/O bandwidth bottleneck for bursty I/O operations like checkpointing by aggregating memory and interprocessor bandwidth.
منابع مشابه
A hardware MP3 decoder with low precision floating point intermediate storage
The effects of using limited precision floating point for intermediate storage in an embedded MP3 decoder are investigated in this thesis. The advantages of using limited precision is that the values need shorter word lengths and thus a smaller memory for storage. The official reference decoder was modified so that the effects of different word lengths and algorithms could be examined. Finally,...
متن کاملAn Implementation of Using Remote Memory to Checkpoint Processes
Process checkpointing is a procedure which periodically saves the process states into stable storage. Most checkpointing facilities select hard disks for archiving. However, the disk seek time is limited by the speed of the read-write heads, thus checkpointing process into a local disk requires extensive disk bandwidth. In this paper, we propose an approach that exploits the memory on idle work...
متن کاملRollback Recovery Scheme for Distributed Shared Memory Clusters
In this paper, an unified lightweight error recovery scheme based on coordinated checkpointing and rollback for distributed shared memory clusters is proposed. The new scheme maintains multiple globally consistent checkpoints of the state of a distributed shared memory cluster and recovers to a pre-fault checkpoint of the system. It also describes and evaluates the coordinated checkpointing. Th...
متن کاملManaging Checkpoints for Parallel Programs
Checkpointing is a valuable tool for any scheduling system to have. With the ability to checkpoint, schedulers are not locked into a single allocation of resources to jobs, but instead can stop running jobs, and re-allocate resources with out sacriicing any completed computations. Checkpointing techniques are not new, but they have not been widely available on parallel platforms. We have implem...
متن کاملA Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States
Checkpoint/restart has been an effective mechanism to achieve fault tolerance for many long-running scientific applications. The common approach is to save computation states in memory and secondary storage for execution resumption. However, as the GPU plays a much bigger role in high performance computing, there is no effective checkpoint/restart scheme yet due to the difficulty of the GPU com...
متن کامل